10 research outputs found

    Small Area Estimation under Limited Auxiliary Population Data Dealing with Model Violations and their Economic Applications

    Get PDF
    For evidence-based policy-making, reliable information on socio-economic indicators are essential. Sample surveys have a long tradition of providing cost-efficient information on these indicators. Mostly, there is a demand for the quantity of interest not only at the level of the total population, but especially at the level of sub-populations (geographic areas or sociodemographic groups) called areas or domains. To gain insights into these sub-populations, disaggregated direct estimators can be used, which are calculated solely on area-specific survey data. An area is regarded as ā€™largeā€™ if the sample size is large enough to enable reliable direct estimates. If the precision of the direct estimates is not sufficient or the sample size is even zero, the area is considered as ā€™smallā€™. This is particularly common at high spatial or socio-demographic resolutions. Small area estimation (SAE) is promising to overcome this problem without the need for larger and thus more costly surveys. The essence of SAE techniques is that they ā€™borrow strengthā€™ from other areas to improve their predictions. For this purpose, a model is built on survey data that links additional auxiliary data and exploits area-specific structures. Suitable auxiliary data sources are administrative and register data, such as the census. In many countries, such data are strictly protected by confidentiality agreements and access to population micro-data is a challenge even for gatekeeper organisations. Thus, users have an increased interest in SAE estimators that do not require population micro-data to serve as auxiliary data. In this thesis, new methods in the absence of population micro-data are presented and applications on socio-economic highly relevant indicators are demonstrated. Since different SAE models impose different data requirements, Part I bundles research combining unit-level survey data and limited auxiliary data, e.g., aggregated data such as means, which is a common data situation for users. To account for the unit-level survey information the use of the well-known nested error regression (NER) model is targeted. This model is a special case of a linear mixed model based on several assumptions. But how can users proceed if the model assumptions are not fulfilled? In Part I, this thesis provides two new approaches to deal with this issue. One promising approach is to transform the response. Since several socio-economically relevant variables, such as income, have a skewed distribution, the log-transformation of the response is an established way to meet the assumptions. However, the data-driven log-shift transformation is even more promising because it extends the log by an additional parameter and achieves more flexibility. Chapter 1 introduces both transformations in the absence of population micro-data. A particular challenge is the transformation of the small area means back to the original scale. Hence, the proposed approach introduces aggregate statistics (means and covariances) and kernel density estimation to resolve the issue of lacking population micro-data. Uncertainty estimation is developed, and all methods are evaluated in design- and model-based settings. The proposed method is applied to estimate regional income in Germany using the Socio-Economic Panel and census data. It achieves a clear improvement in reliability, and thus demonstrates the importance of the method. To conveniently enable further applications, this new methodology is implementedin the R package saeTrafo. Chapter 2 describes the various functionalities of the package using publicly available income data. To increase user-friendliness, established unit-level models under transformations and their uncertainty estimations are implemented and the most suitable method is automatically selected. For some applications, however, it is challenging to find a suitable transformation or, more generally, to specify a model, particularly in the presence of complex interactions. For this case, machine learning methods are valuable as a transformation is not necessarily required nor a model needs to be explicitly specified. The semi-parametric framework of mixed effects random forest (MERF) combines the advantages of random forests (robustness against outliers and implicit model-selection) with the ability to model hierarchical dependencies as present in SAE approaches. Chapter 3 introduces MERFs in the absence of population micro-data. As existing random forest algorithm require unit-level auxiliary population data, an alternative strategy is introduced. It adaptively incorporates aggregated auxiliary information through calibration-weights to circumvent unit-level auxiliary data. Applying the proposed method on opportunity costs of care work for Germany using the Socio-Economic Panel and census data demonstrates the gain in accuracy in comparison to both direct estimates and the classical NER model. In contrast to methods using a unit-level sample survey, Part II focuses on the well-known class of area-level SAE models requiring direct estimates from a survey while using (once again) only aggregated population auxiliary data. This thesis presents two particularly relevant applications of this model class. Chapter 4 examines regional consumer price indices (CPIs) in the United Kingdom (UK), contributing to the great interest in monitoring inflation at the spatial level. The SAE challenge is to construct model-based expenditure weights to generate the regional basket of goods and services for the twelve regions of the UK. They are estimated and constructed from the living cost and food survey. Furthermore, available price data are linked to the SAE estimated baskets to produce regional CPIs. The resulting CPI series are closely examined, and smoothing techniques are applied. As a result, the reliability improves, but the CPI series are still too volatile for policy use. However, our research serves as a valuable framework for the creation of a regional CPI in the future. The second application also explores the reliability of the disaggregated estimation of a politically and economically highly relevant indicator, in this case the unemployment rate. The regional target level are the functional urban areas in the German federal state North Rhine-Westphalia. In Chapter 5, two types of unemployment rates - the traditional one and an alternative definition taking commuting into account - are estimated and compared. Direct estimates from the labour force survey are linked with SAE methods to passively collected mobile network data. This alternative data source is real-time available, offers spatial flexible resolutions, and is dynamic. In compliance with data protection rules, we obtain aggregated auxiliary mobile network information from the data provider. The SAE methods improve the reliability, and the resulting predictions show that alternative unemployment rates in German city cores are lower than traditional estimated official unemployment rates indicate

    Estimating regional unemployment with mobile network data for Functional Urban Areas in Germany

    Get PDF
    The ongoing growth of cities due to better job opportunities is leading to increased labour-relatedcommuter flows in several countries. On the one hand, an increasing number of people commuteand move to the cities, but on the other hand, the labour market indicates higher unemployment ratesin urban areas than in the surrounding areas. We investigate this phenomenon on regional level byan alternative definition of unemployment rates in which commuting behaviour is integrated. Wecombine data from the Labour Force Survey (LFS) with dynamic mobile network data by small areamodels for the federal state North Rhine-Westphalia in Germany. From a methodical perspective, weuse a transformed Fay-Herriot model with bias correction for the estimation of unemployment ratesand propose a parametric bootstrap for the Mean Squared Error (MSE) estimation that includes thebias correction. The performance of the proposed methodology is evaluated in a case study based onofficial data and in model-based simulations. The results in the application show that unemploymentrates (adjusted by commuters) in German cities are lower than traditional official unemployment ratesindicate

    Estimating regional income indicators under transformations and access to limited population auxiliary information

    Get PDF
    Spatially disaggregated income indicators are typically estimated by using model-based methods that assume access to auxiliary information from population micro-data. In many countries like Germany and the UK population micro-data are not publicly available. In this work we propose small area methodology when only aggregate population-level auxiliary information is available. We use data-driven transformations of the response to satisfy the parametric assumptions of the used models. In the absence of population micro-data, appropriate bias-corrections for small area prediction are needed. Under the approach we propose in this paper, aggregate statistics (means and covariances) and kernel density estimation are used to resolve the issue of not having access to population micro-data. We further explore the estimation of the mean squared error using the parametric bootstrap. Extensive model-based and design-based simulations are used to compare the proposed method to alternative methods. Finally, the proposed methodology is applied to the 2011 Socio-Economic Panel and aggregate census information from the same year to estimate the average income for 96 regional planning regions in Germany

    Association of genetic variation with systolic and diastolic blood pressure among African Americans: the Candidate Gene Association Resource study

    Get PDF
    The prevalence of hypertension in African Americans (AAs) is higher than in other US groups; yet, few have performed genome-wide association studies (GWASs) in AA. Among people of European descent, GWASs have identified genetic variants at 13 loci that are associated with blood pressure. It is unknown if these variants confer susceptibility in people of African ancestry. Here, we examined genome-wide and candidate gene associations with systolic blood pressure (SBP) and diastolic blood pressure (DBP) using the Candidate Gene Association Resource (CARe) consortium consisting of 8591 AAs. Genotypes included genome-wide single-nucleotide polymorphism (SNP) data utilizing the Affymetrix 6.0 array with imputation to 2.5 million HapMap SNPs and candidate gene SNP data utilizing a 50K cardiovascular gene-centric array (ITMAT-Broad-CARe [IBC] array). For Affymetrix data, the strongest signal for DBP was rs10474346 (P= 3.6 Ɨ 10āˆ’8) located near GPR98 and ARRDC3. For SBP, the strongest signal was rs2258119 in C21orf91 (P= 4.7 Ɨ 10āˆ’8). The top IBC association for SBP was rs2012318 (P= 6.4 Ɨ 10āˆ’6) near SLC25A42 and for DBP was rs2523586 (P= 1.3 Ɨ 10āˆ’6) near HLA-B. None of the top variants replicated in additional AA (n = 11 882) or European-American (n = 69 899) cohorts. We replicated previously reported European-American blood pressure SNPs in our AA samples (SH2B3, P= 0.009; TBX3-TBX5, P= 0.03; and CSK-ULK3, P= 0.0004). These genetic loci represent the best evidence of genetic influences on SBP and DBP in AAs to date. More broadly, this work supports that notion that blood pressure among AAs is a trait with genetic underpinnings but also with significant complexit

    Association of genetic variation with systolic and diastolic blood pressure among African Americans: the Candidate Gene Association Resource study.

    Get PDF
    The prevalence of hypertension in African Americans (AAs) is higher than in other US groups; yet, few have performed genome-wide association studies (GWASs) in AA. Among people of European descent, GWASs have identified genetic variants at 13 loci that are associated with blood pressure. It is unknown if these variants confer susceptibility in people of African ancestry. Here, we examined genome-wide and candidate gene associations with systolic blood pressure (SBP) and diastolic blood pressure (DBP) using the Candidate Gene Association Resource (CARe) consortium consisting of 8591 AAs. Genotypes included genome-wide single-nucleotide polymorphism (SNP) data utilizing the Affymetrix 6.0 array with imputation to 2.5 million HapMap SNPs and candidate gene SNP data utilizing a 50K cardiovascular gene-centric array (ITMAT-Broad-CARe [IBC] array). For Affymetrix data, the strongest signal for DBP was rs10474346 (P= 3.6 Ɨ 10(-8)) located near GPR98 and ARRDC3. For SBP, the strongest signal was rs2258119 in C21orf91 (P= 4.7 Ɨ 10(-8)). The top IBC association for SBP was rs2012318 (P= 6.4 Ɨ 10(-6)) near SLC25A42 and for DBP was rs2523586 (P= 1.3 Ɨ 10(-6)) near HLA-B. None of the top variants replicated in additional AA (n = 11 882) or European-American (n = 69 899) cohorts. We replicated previously reported European-American blood pressure SNPs in our AA samples (SH2B3, P= 0.009; TBX3-TBX5, P= 0.03; and CSK-ULK3, P= 0.0004). These genetic loci represent the best evidence of genetic influences on SBP and DBP in AAs to date. More broadly, this work supports that notion that blood pressure among AAs is a trait with genetic underpinnings but also with significant complexity

    Estimating regional income indicators under transformations and access to limited population auxiliary information.

    Get PDF
    Spatially disaggregated income indicators are typically estimated by using model-based methods that assume access to auxiliary information from population micro-data. In many countries like Germany and the UK population micro-data are not publicly available. In this work we propose small area methodology when only aggregate population-level auxiliary information is available. We use data-driven transformations of the response to satisfy the parametric assumptions of the used models. In the absence of population micro-data, appropriate bias-corrections for small area prediction are needed. Under the approach we propose in this paper, aggregate statistics (means and covariances) and kernel density estimation are used to resolve the issue of not having access to population micro-data. We further explore the estimation of the mean squared error using the parametric bootstrap. Extensive model-based and design-based simulations are used to compare the proposed method to alternative methods. Finally, the proposed methodology is applied to the 2011 Socio-Economic Panel and aggregate census information from the same year to estimate the average income for 96 regional planning regions in Germany

    Experimental UK Regional Consumer Price Inflation with Model-Based Expenditure Weights

    Get PDF
    Like many other countries, the United Kingdom (UK) produces a national consumer priceindex (CPI) to measure inflation. Presently, CPI measures are not produced for regions withinthe UK. It is believed that, using only available data sources, a regional CPI would not beprecise or reliable enough as an official statistic, primarily because the regional partitioning ofthe data makes sample sizes too small. We investigate this claim by producing experimentalregional CPIs using publicly available price data, and deriving expenditure weights from theLiving Costs and Food survey. We detail the methods and challenges of developing a regional CPI and evaluate its reliability. We then assess whether model-based methods such assmoothing and small area estimation significantly improve the measures. We find that a regional CPI can be produced with available data sources, however it appears to beexcessively volatile over time, mainly due to the weights. Smoothing and small areaestimation improve the reliability of the regional CPI series to some extent but they remain toovolatile for regional policy use. This research provides a valuable framework for thedevelopment of a more viable regional CPI measure for the UK in the future

    Construction of regional consumer price indices using small area estimation

    No full text
    Consumer Price Indices (CPI) are used in many ways by the government, businesses, and society in general. They can affect interest rates, tax allowances, wages, state benefits, and many other payments. The CPI is a fixed (national) basket index, where a range of goods and services is priced each month, and the expenditure shares on items in the basket are used to weight the price information together. The starting point for a regional price index should be a regional basket of goods and services. In the current poster, we derive regional baskets from the UK Living Costs and Food Survey (LCF), taking the products (COICOP classification) with the largest proportion of expenditures. As the sample size is naturally much smaller for regions, the accuracy of the direct estimates on the basket will be reduced. In order to overcome this problem one possibility - discussed in the poster - is to pool multiple years of LCF data to increase the sample size. Another is to consider small area estimation approaches for the regional basket. Ideally, the small area estimates would be constrained to the overall expenditure total. Therefore, we assess some benchmarking approaches. Since the conceptual framework of CPI-calculationfor the UK and Germany do not differ too much the presented methodology can also be adapted for the calculation of regional CPIs for Germany.<br/

    Deliverable 3.2 - Guidelines for best practices implementation for transferring methodology

    No full text
    Today, big data is a buzz word. Although there have been attempts to properly define the term, a really universally accepted definition has not yet been given. Accordingly, many different types of data may be classified as big data or new data. These range from scanner data collected at retail outlets, through remote sensing data to mobile phone data. As the availability of such data increases, researchers try to make use of them by incorporating them into existing methods and developing new methods. These developments are also highly relevant for the estimation of well-being indicators, a core focus of the MAKSWELL project. The combination of new data sources and new or modified methods are promising especially where the estimation of well-being at a fine spatial resolution is concerned. While a comprehensive survey of the related literature and available data sets is out of the scope of this project, this deliverable collects a few (experimental) applications that shed a light on the potential benefits of these new approaches. Some drawbacks and practical implementation problems are addressed as well. Taken as a whole, the presented set of applications points to future research needs in the area and allows the derivation of some general best practice guidelines that can also inform other subject matter areas beside the measurement of poverty and well-being
    corecore